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Abstract 

An approach to clustering is presented that 
adapts the basic top-down induction of de- 
cision trees method towards clustering. To 
this aim, it employs the principles of instance 
based learning. The resulting methodology 
is implemented in the TIC (Top down In- 
duction of Clustering trees) system for first 
order clustering. The TIC system employs 
the first order logical decision tree representa- 
tion of the inductive logic programming sys- 
tem Tilde. Various experiments with TIC 
are presented, in both propositional and re- 
lational domains. 



1 INTRODUCTION 

Decision trees are usually regarded as representing the- 
ories for classification. The leaves of the tree contain 
the classes and the branches from the root to a leaf 
contain sufficient conditions for classification. 

A different viewpoint is taken in Elements of Machine 

According to Langley, each 



in the clustering context [Fisher, 1993 1 and mentions 
a few clustering systems that work in a TDIDT-like 



Learning | Langley, 1996 



node of a tree corresponds to a concept or a clus- 
ter, and the tree as a whole thus represents a kind 
of taxonomy or a hierarchy. Such taxonomies are not 
only output by decision tree algorithms but typically 
also by clustering algorithms such as e.g. COBWEB 
I Fisher, 1987 . Therefore, Langley views both cluster- 



ing and concept-learning as instantiations of the same 
general technique, the induction of concept hierarchies. 
The similarity between classification trees and cluster- 
ing trees has also been noted by Fisher, who points to 
the possibility of using TDIDT (or TDIDT heuristics) 



fashion | Fisher and Langley, 1985 



* The authors are listed in alphabetical order. 



Following these views we study top-down induction of 
clustering trees. A clustering tree is a decision tree 
where the leaves do not contain classes and where 
each node as well as each leaf corresponds to a cluster. 
To induce clustering trees, we employ principles from 
instance based learning and decision tree induction. 
More specifically, we assume that a distance measure 
is given that computes the distance between two exam- 
ples. Furthermore, in order to compute the distance 
between two clusters (i.e. sets of examples), we employ 
a function that computes a prototype of a set exam- 
ples. A prototype is then regarded as an example, 
which allows to define the distance between two clus- 
ters as the distance between their prototypes. Given 
a distance measure for clusters and the view that each 
node of a tree corresponds to a cluster, the decision 
tree algorithm is then adapted to select in each node 
the test that will maximize the distance between the 
resulting clusters in its subnodes. 

Depending on the examples and the distance mea- 
sure employed one can distinguish two modes. In 
supervised learning (as in the classical top-down in- 
duction of decision trees paradigm), the distance mea- 
sure only takes into account the class information of 
each example (see e.g. C4.5 [Quinlan, 1993|, CART 
Breiman et ai, 1984|). Also, regression trees (SRT 
Kramer, 1996[ , CART) should be considered super- 
vised learning. In unsupervised learning, the examples 
may not be classified and the distance measure does 
not take into account any class information. Rather, 
all attributes or features of the examples are taken into 
account in the distance measure. 

The Top-down Induction of Clustering trees approach 
is implemented in the TIC system. TIC is a first order 



clustering system as it does not employ the classical at- 
tribute value representation but that of first order logi- 



atom(X,Y,14,Z)? 



cal decision trees as in SRT ]Kramcr, 1996 and Tilde 
Blocked and De Raedt, 1998 1. So, the clusters corre- 
sponding to the tree will have first order definitions. 
On the other hand, in the current implementation of 
TIC we only employ propositional distance measures. 

Using TIC we report on a number of experiments. 
These experiments demonstrate the power of top-down 
induction of clustering trees. More specifically, we 
show that TIC can be used for clustering, for regres- 
sion, and for learning classifiers. 

This paper sign ificantly expands on an ear lier ex- 
tended abstract [ De Raedt and Blocked, 1997 1 in that 
TIC now contains a pruning method and also that this 
paper provides new experimental evidence. 

This paper is structured as follows. In Section 2 we 
discuss the representation of the data and the induced 
theories. Section 3 identifies possible applications of 
clustering. The TIC system is presented in Section 
4. In Section 5 we empirically evaluate TIC for the 
proposed applications. Section 6 presents conclusions 
and related work. 

2 THE LEARNING PROBLEM 

2.1 REPRESENTING EXAMPLES 

We employ the learning from interpretations setting 
for inductive logic programming. For the purposes of 
this paper, it is sufficient to regard each example as a 
small relational database, i.e. as a set of facts. Within 
learning from interpretations, one may also specify 
background knowledge in the form of a Prolog pro- 
gram which can be use d to derive additional features 
of the examples . ^] See |De Raedt and Dzcroski, 1994 , 
De Raedt, 1996 , Dc Raedt et a/., 199g | for more de- 
tails on learning from interpretations. 

For instance, examples for the well-known muta- 

can be de- 



genesis problem [Srinivasan et 



1996 



scribed by interpretations. Here, an interpreta- 
tion is simply an enumeration of all the facts we 
know about one single molecule: its class, lumo 
and logp values, the atoms and bonds occurring 
in it, certain high-level structures. . . We can rep- 
resent it e.g. as follows: {logmutag(-0.7), neg, 
lumo(-3.025), logp(2.29), atom(dl89.1,c,22,-0.11), 
atom(dl89_2,c,22,-0.11), bond(dl89.1,dl89.2,7). 




atom(U,V,8,W)? 



Figure 1: A clustering tree 



bond(dl89.2,dl89.3,7), . . . } 

2.2 FIRST ORDER LOGICAL DECISION 
TREES 

First order logical decision trees are similar to stan- 
dard decision trees, except that the test in each node 
is a conjunction of literals instead of an test on an at- 
tribute. They are always binary, as the test can only 
succeed or fail. A detailed discussion of these trees 
is beyond the scope of this paper but can be found 
in [Blocked and De Raedt, 1998]. We will use these 
trees to represent clustering trees. 

An example of a clustering tree, in the mutagenesis 
context, is shown in Figure |l|. Note that in a classical 
logical decision tree leaves would contain classes. Here, 
leaves simply contain sets of examples that belong to- 
gether. Also note that variables occurring in tests are 
existentially quantified. The root test, for instance, 
tests whether there occurs an atom of type 14 in the 
molecule. The whole set of examples is thus divided 
into two clusters: a cluster of molecules containing an 
atom 14 and a cluster of molecules not containing any. 

This view is in correspondence with Langley's view- 
point that a test in a node is not just a decision crite- 
rion, but also a description of the subclusters formed in 
this node. In [ Blocked and De Raedt, 1998| we show 
how a logical decision tree can be transformed into an 
equivalent logic program, which could alternatively be 
used to sort examples into clusters. The logic pro- 
gram contains invented predicates that correspond to 
the clusters. 



^The interpretation corresponding to each example e is 
then the minimal Herbrand model of _B A e. 



2.3 INSTANCE BASED LEARNING AND 
DISTANCES 

The purpose of conceptual clustering is to obtain clus- 
ters such that intra-cluster distance (i.e. the distance 
between examples belonging to the same cluster) is 
as small as possible and the inter-cluster distance (i.e. 
the distance between examples belonging to different 



clusters) is as large as possible. 

In this paper, we assume that a distance measure d 
that computes the distance c?(ei,e2) between exam- 
ples ei and 62 is given. Furthermore, there is also a 
need for measuring the distance between different clus- 
ters (i.e. between sets of examples). Therefore we will 
assume as well the existence of a prototype function p 
that computes the prototype p{E) of a set of examples 
E. The distance between two clusters Ci and C2 is 
then defined as the distance d{p{Ci),p{C2)) between 
the prototypes of the clusters. This shows that the 
prototypes should be considered as (possibly) partial 
example descriptions. The prototypes should be suf- 
ficiently detailed as to allow the computation of the 
distances. 

For instance, the distance could be the Euclidean 
distance di between the values of one or more nu- 
merical attributes, or it could be the distance d2 as 
measured by a first order distance measure such as 
used in RIBL [|Emde and Wettschereck, 19961 or KBG 



Bisson, 1992 1 or [Hutchinson, 1997| 



Given the distance at the level of the examples, the 
principles of instance based learning can be used to 
compute the prototypes. E.g. di would result in a 
prototype function pi that would simply compute the 
mean for the cluster, whereas d2 could result in func- 
tion p2 that would compute the (possibly reduced) 
least general generalisation^ of the examples in the 
cluster. 

Throughout this paper we employ only propositional 
distance measures and the prototype functions that 
correspond to the instanc e averaging methods along 
the lines of [ Langlcy, 199(| . However, we stress that - 
in principle - we could use any distance measure. No- 
tice that although we employ only propositional dis- 
tance measures, we obtain first order descriptions of 
the clusters through the representation of first order 
logical decision trees. 

2.4 PROBLEM-SPECIFICATION 

By now we are able to formally specify the clustering 
problem: 

Given 

• a set of examples E (each example is a set of tuples 



in a relational database or equivalently, a set of 
facts in Prolog), 

• a background theory B in the form of a Prolog 
program, 

• a distance measure d that computes the distance 
between two examples or prototypes, 

• a prototype function p that computes the proto- 
type of a set of examples. 

Find: a first order clustering tree. 

Before discussing how this problem can be solved we 
take a look at possible applications of clustering trees. 

3 APPLICATIONS OF 
CLUSTERING TREES 

Following Langley's viewpoint, a system such as C4.5 
can be considered a supervised clustering system 
where the "distance" metric is the class entropy within 
the clusters : lower class entropy within a cluster 
means that the examples in that cluster are more sim- 
ilar with respect to their classes. Since C4.5 employs 
class information, it is a supervised learner. 

Clustering can also be done in an unsupervised manner 
however. When making use of a distance metric to 
form clusters, this distance metric may or may not use 
information about the classes of the examples. Even 
if it does not use class information, clusters may be 
coherent with respect to the class of the examples in 
them. 

This principle leads to a classification technique that 
is very robust with respect to missing class informa- 
tion. Indeed, even if only a small percentage of the 
examples is labelled with a class, one could perform 
unsupervised clustering, and assign to each leaf in the 
concept hierarchy the majority class in that leaf. If 
the leaves are coherent with respect to classes, this 
method would yield relatively high classification accu- 
racy with a minimum of class information available. 
This is quite similar in spirit to Emde's method for 
learning from few classified examples, implemented in 
the COLA system [|Emdc, 1994- 



^Using Plotkin's [ |Plotkin, IQTOt ] notion of 0- 
subsumpti on or the va r iants corresponding t o structural 



matching [Bisson, 1992, De Raedt et al, 1997] 



A similar reasoning can be followed for regression, 
leading to "unsupervised regression" ; again this may 
be useful in the case of partially missing information. 

We conclude that clustering can extend classification 
and regression towards unsupervised learning. An- 



other extension in the predictive context is that clus- 
ters can be used to predict many or all attributes of 
an example at once. 

Depending on the application one has in mind, mea- 
suring the quality of a clustering tree is done in differ- 
ent ways. For classification purposes predictive accu- 
racy on unseen cases is typically used. For regression 
an often used criterion is the relative error, which is 
the mean squared error of predictions divided by the 
mean squared error of a default hypothesis always pre- 
dicting the mean. This can be extended towards the 
clustering context if a distance measure and prototype 
function are available: 



RE = 



with Ci the examples, the predictions and p the pro- 
totype. (A prediction is, just like a prototype, a par- 
tial example description that is sufficiently detailed to 
allow the computation of a distance). 

If clustering is considered as unsupervised learning of 
classification or regression trees, the relative error of 
only the predicted variable or the accuracy with which 
the class variable can be predicted is a suitable quality 
criterion. In this case classes should be available for 
the evaluation of the clustering tree, though not during 
(unsupervised) learning. Such an evaluation is often 
done for clusters, see e.g. [ Fisher, 1987 |. 



4 TIC: TOP-DOWN INDUCTION 
OF CLUSTERING TREES 

A system for top-down induction of clustering trees 
called TIC has been implemented subsystem of 



the ILP system Tilde [Blocked and De Raedt, 199S[. 
TIC employs the basic TDIDT framework as it is 
also incorporated in the Tilde system. The main 
point where TIC and Tilde differ from the propo- 
sitional TDIDT algorithm is in the computation of 
the (first order) tests to be placed in a node, see 
Blockeel and De Raedt, 1998 for details. Further- 
more, TIC differs from Tilde in that it uses other 
heuristics for splitting nodes, an alternative stopping 
criterion and alternative tree post-pruning methods. 
We discuss these topics below. 

4.1 SPLITTING 

The splitting criterion used in TIC works as follows. 
Given a cluster C and a test T that will result in two 
disjoint subclusters Ci and C2 of C, TIC computes 



the distance d{p{Ci) , p{C2)) , where p is the prototype 
function. The best test T is then the one that maxi- 
mizes this distance. This reflects the principle that the 
inter-cluster distance should be as large as possible. 

If the prototype is simply the mean, then maximiz- 
ing inter-cluster distances corresponds to minimizing 
intra-cluster distances , and splitting heuristics such 
as information gain Quinlan, 1993 or Gini index 
Brciman et ai, 1984 [ can be seen as special cases of 



the above principle, as they minimize intra-cluster 
class diversity. In the regr ession context, minimizing 
intra-cluster variance (e.g. [ Kramer, 1996 ) is another 
instance of this principle. 

Note that our distance-based approach has the advan- 
tage of being applicable to both numeric and symbolic 
data, and thus generalises over regression and classifi- 
cation. 

4.2 STOPPING CRITERIA 

Stopping criteria are often based on significance 
tests. In the classification context a x^-test is often 
used to check whether the class distributions in the 



subtrees differ significantly [Clark and Niblett, 1989 



De Raedt and Van Laer, 1995 [. Since regression and 
clustering use variance as a heuristic for choosing the 
best split, a reasonable heuristic for the stopping cri- 
terion seems to be the F-test. If a set of examples 
is split into two subsets, the variance should decrease 
significantly, i.e. 



F = 



SS/{n - 1) 



(SSl + SSR)/{n - 2) 



should be significantly large {SS is the sum of squared 
differences from the mean inside the set of examples, 
SSl and SSr is the same for the two created subsets 
of the examples, n is the total number of examples).^ 

4.3 PRUNING USING A VALIDATION 
SET 

The principle of using a validation set to prune trees 
is very simple. After using the training set to build a 
tree, the quality of the tree is computed on the valida- 
tion set (predictive accuracy for classification trees, in- 
verse of relative error for regression or clustering trees). 



^The F-test is only theoretically correct for normally 
distributed populations. Since this assumption may not 
hold, it should here be considered a heuristic for deciding 
when to stop growing a branch, not a real statistical test. 



For each node of the tree the quahty of the tree if it 
were pruned at that node Q' is compared with the 
quahty Q of the unpruned tree, li Q' > Q then the 
tree is pruned. 

Such a strategy has been successfully followed in 
the context of classification and regression (e.g. 
CART [ Brciman et al, 1984| ) as well as clustering 
(e.g. jFisher, 1996(1 ). Fisher's method is more com- 
plex than ours in that for each individual variable a 
different subset of the original tree will be used for 
prediction. 

In the current implementation of Tilde validation set 
based pruning is available for all settings. For clus- 
tering and regression it is the only pruning criterion 
that is implemented. It is only reliable for reasonably 
large data sets though. When learning from small data 
sets performance decreases because the training set be- 
comes even smaller and with a small validation set a 
lot of pruning is due to random influences. 

5 EXPERIMENTS 
5.1 DATA SETS 

We used the following data sets for our experiments: 



molecules), the other 2 contain higher level infor- 
mation (attributes describing the molecule as a 
whole and higher level submolecular structures). 
For our experiments the tests allowed in the 
trees can make use of structural information only 
(Background 2), though for the heuristics numer- 
ical information from background 3 can be used. 

• Biodegradability: a set of 62 molecules of which 
structural descriptions and molecular weights are 
given. The biodegradability of the molecules is to 
be predicted. This is a real number, but has been 
discretized into four values (fast, moderate, slow, 
resistant) in most past experiments. The dataset 
was provided to us by S. Dzeroski but is not yet 
in the public domain. 

The data sets were deliberately chosen to include both 
propositional and relational data sets. For each indi- 
vidual experiment the most suitable data sets were 
chosen (w.r.t. size, suitability for a specific task, and 
relevant results published in the literature). 

Distances were always computed from all numerical 
attributes, except when stated otherwise. For the Soy- 
beans data sets all nominal attributes were converted 
into numbers first. 



Soybeans: this database 

[ Michalski and Chilausky, 1980[ contains descrip- 
tions of diseased soybean plants. Every plant is 
described by 35 attributes. A small data set (46 
examples, 4 classes) and a large one (307 exam- 
ples, 19 classes) are available at the UCI reposi- 



tory [Merz and Murphy, 199(;| 



• Iris: a simple database of descriptions of iris 
plants, available at the UCI repository. It con- 
tains 3 classes of 50 examples each. There are 4 
numerical attributes. 

• Mutagenesisj 



this database [ Srinivasan et a/., 1996 contains de- 
scriptions of molecules for which the mutagenic 
activity has to be predicted. Originally muta- 
genicity was measured by a real number, but in 
most experiments with ILP systems this has been 
discretized into two values (positive and nega- 
tive). The database is available at the ILP repos- 
itory Kazakov et al, 1996{ . 

[ Srinivasan et al, 1995| introduce four levels of 



background knowledge; the first 2 contain only 
structural information (atoms and bonds in the 



5.2 EXPERIMENT 1: PRUNING 

In this first experiment we want to evaluate the effect 
of pruning in TIC on both predictive accuracy and tree 
complexity. We have applied TIC to two databases: 
Soybeans (large version) and Mutagenesis. We chose 
these two because they are relatively large (as noted 
before, the pruning strategy is prone to random influ- 
ences when used with small datasets). 

For both data sets tenfold crossvalidations were per- 
formed. In each run the algorithm divides the learning 
set in a training set and a validation set. Clustering 
trees are built and pruned in an unsupervised manner. 
The clustering hierarchy before and after pruning is 
evaluated by predicting the class of each test example. 

In Figure H, the average accuracy of the clustering hi- 
erarchies before and after pruning is plotted against 
the size of the validation set (this size is a parameter 
of TIC), and the same is done for the tree complex- 
ity. The same results for the Mutagenesis database are 
summarised in Figure |[ 

From the Soybeans experiment it can be concluded 
that Tie's pruning method results in a slight decrease 
in accuracy but a large decrease in the number of 
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Figure 2: Soybeans: a) Accuracy before and after 
pruning; b) number of nodes before and after prun- 
ing 



nodes. The pruning strategy seems relatively stable 
w.r.t. the size of the validation set. The Mutage- 
nesis experiment confirms these findings (though the 
decrease in accuracy is less clear here). 

5.3 EXPERIMENT 2: COMPARISON 
WITH OTHER LEARNERS 

In this experiment we compare TIC with propositional 
clustering systems and with classification and regres- 
sion systems. A comparison with propositional cluster- 
ing systems is hard to make because few quantitative 
results are available in the literature, therefore we also 
compare with supervised learners. 

We apphed TIC to the Soybean (small) and Iris 
databases, performing tenfold crossvalidations. Learn- 
ing is unsupervised, but classes are assumed to be 
known at evaluation time (the class of a test exam- 
ple is compared with the majority class of the leaf 
the example is sorted into). Table |l| compares the re- 
sults with those obtained with the supervised learner 
Tilde. 

We see that TIC obtains high accuracies for these 
problems. The only clustering result we know of is 
for COBWEB, which obtained 100% on the Soybean 
data set. This difference is not significant. Tilde's ac- 
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Figure 3: Mutagenesis: Accuracy and size of the clus- 
tering trees 





TIC 


Tilde 


Database 


acc. 


tree size 


acc. 


tree size 


Soybean 


97% 


3.9 nodes 


100% 


3 nodes 


Iris 


92% 


15 nodes 


94% 


4 nodes 



Table 1 : Comparison of TIC with a supervised learner 
(averages over 10-fold crossvalidation) . 



curacies don't differ much from those of TIC which in- 
duced the hierarchy without knowledge of the classes. 
Tree sizes are smaller though. 

We have also performed an experiment on the 
Biodegradability data set, predicting numbers. For 
this dataset the F-test stopping criterion was used (sig- 
nificance level 0.01), but no validation set was used 
given the small size of the data set. The distance used 
is the difference between class values. Table ^ com- 
pares Tie's performance with Tilde's (classification, 
leave-one-out) and SRT's (regression, sixfold). 

Our conclusions are that a) for unsupervised learning 
TIC performs almost as well as other unsupervised or 
supervised learners, if classification accuracy is mea- 
sured; and b) while there is clearly room for improve- 
ment with respect to using TIC for regression, post- 
discretization of the regression predictions shows that 
this approach is competitive with classical approaches 
to classification. 



l.o.o. Tilde classification acc. = 0.532 

l.o.o. TIC regression RE = 0.740 

l.o.o. TIC classif. via regression acc. = 0.565 

6-fold SRT regression RE = 0.34 

6-fold TIC regression RE = 1.13 
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0-3 


62.7% 


91.2% 


fruit_spots 


0-4 


53.4% 


87.0% 


seed 


0-1 


73.9% 


85.7% 


mold_growth 


0-1 


80.5% 


86.6% 


seed_discoIor 


0-1 


79.5% 


84.0% 


seed_size 


0-1 


81.8% 


88.6% 


shriveling 


0-1 


83.4% 


87.9% 


roots 


0-2 


84.7% 


95.8% 


mean 






81.6% 



Table 3: Prediction of all attributes together in the 
Soybean data set 



racy of trees when class information as well as other 
information may be missing, not only for learning, but 
also for assigning classes to leaves afterwards, and this 
for several levels of missing information. Our aim is to 
investigate how predictive accuracy deteriorates with 
missing information, and to compare clustering sys- 
tems that use only class information with systems that 
use more information. 

We have used the Mutagenesis data set for this exper- 
iment (for each example, there was a fixed probability 
that the value of a certain attribute was removed from 
the data; this probability was increased for consecu- 
tive experiments), comparing the use of only class in- 
formation (logmutag) with the use of three numerical 
variables (among which the class) for computing dis- 



Table 2: Comparison of regression and classification 
on the biodegradability data (l.o.o.=leave-one-out). 



5.4 EXPERIMENT 3: PREDICTING 
MULTIPLE ATTRIBUTES 

Clustering allows to predict multiple attributes. Since 
examples in a leaf must resemble each other as much 
as possible, attributes must also agree as much as pos- 
sible. 

By sorting unseen examples down a cluster tree and 
comparing all attributes of the example with the pro- 
totype attributes, we get an idea of how good the tree 
is. This is an extension of the classical evaluation, as 
each attribute in turn is a class now. 

We did a tenfold crossvalidation for the following ex- 
periment: using the training set a clustering tree is 
induced. Then, all examples of the test set are sorted 
in this hierarchy, and the prediction for all of their 
attributes is evaluated. For each attribute, the value 
that occurs most frequently in a leaf is predicted for 
all test examples sorted in that leaf. 

We used the large soybean database, with pruning. 
Table ^ summarizes the accuracies obtained for each 
attribute and compares with the accuracy of major- 
ity prediction. The high accuracies show that most 
attributes can be predicted very well, which means 
the clusters are very coherent. The mean accuracy of 
81.6% does not differ significantly from the 83 ± 2% 



reported in [ Fisher, 199' 



5.5 EXPERIMENT 4: HANDLING 
MISSING INFORMATION 

It can be expected that clustering, making use of more 
attributes than just class attributes, is more robust 
with respect to missing values. We showed in Experi- 
ment 2 that unsupervised learners (where the heuris- 
tics do not use any class information at all) can yield 
trees with predictive accuracies close to those of su- 
pervised learners, but all class information was still 
available for assigning classes to leaves after the tree 
was built. 

In this experiment, we measure the predictive accu- 



available numerical data 


logmutag 


all three 


100% 


0.80 


0.81 


50% 


0.78 


0.79 


25% 


0.72 


0.77 


10% 


0.67 


0.74 



Table 4: Classification accuracies obtained for Muta- 
genesis with several distance functions, and on several 
levels of missing information. 



tances. This experiment is similar in spirits to the 
ones performed with COLA [|Emde, 1994[ . Table | 
shows the results. As expected, performance degrades 
less quickly when more information is available, which 
supports the claim that the use of more than just class 
information can improve performance in the presence 
of missing information. 



6 CONCLUSIONS AND RELATED 
WORK 

We have presented a novel first order clustering sys- 
tem TIC within the TDIDT class of algorithms. TIC 
integrates ideas from concept-learning (TDIDT), from 
instance based learning (the distances and the pro- 
totypes), and from inductive logic programming (the 
representations) to obtain a clustering system. Several 
experiments were performed that illustrate the type of 
tasks TIC is useful for. 

As far as related work is concerned, our work is re- 



lated to KBG jBisson, 1992 1, which also performs first 
order clustering. In contrast to the current version of 
TIC, KBG does use a first order similarity measure, 
which could also be used within TIC. Furthermore, 
KBG is an agglomerative (bottom-up) clustering algo- 
rithm and TIC a divisive one (top-down). The divi- 
sive nature of TIC makes TIC as efficient as classical 
TDIDT algorithms. A final difference with KBG is 
that TIC directly obtains logical descriptions of the 
clusters through the use of the logical decision tree 
format. For KBG, these descriptions have to be de- 
rived in a separate step because the clustering process 
only produces the clusters (i.e. sets of examples) and 
not their description. 



The 



instance-based 



learner 



RIBL 



Emde and Wettschereck, 1996 1 uses an advanced first 
order distance metric that might be a good candidate 
for incorporation in TIC. 



While Fisher, 1993| first made the link between 
TDIDT and clustering, our work is inspired mainly 
by |Langley, 199(:]. From this point of view, our 



work is closely related to SRT [Kramer, 199(:|, who 
builds regression trees in a supervised manner. TIC 
can be considered a generalization of SRT in that 
TIC can also build trees in an unsupervised man- 
ner, and can predict multiple values. Finally, 
we should also refer to a number of other ap- 
proaches to first order clustering, which include Klus- 



ter iKietz and Morik, 1994|, |Yoo and Fisher, 1991| , 
Thompson and Langley, 1991| 



and 



Ketterlin et al, 1995|1 



Future work on TIC includes extending the system so 
that it can employ first order distance measures, and 
investigating the limitations of this approach (which 
will require further experiments). 
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